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. Abstract: The large sparse linear systems arising from the finite element or finite dif- 

' ference discretization of elliptic PDEs can be solved directly via, e.g., nested dissection 

^ I or multifrontal methods. Such techniques reorder the nodes in the grid to reduce the 

D I asymptotic complexity of Gaussian elimination from 0{N'^) to 0(A^^-^) for typical prob- 

. lems in two dimensions. It has recently been demonstrated that the complexity can be 

04 ' further reduced to 0{N) by exploiting structure in the dense matrices that arise in such 



< 



computations (using, e.g., 7^-matrix arithmetic). This paper demonstrates that such ac- 
celerated nested dissection techniques become particularly effective for boundary value 
problems without body loads when the solution is sought for several different sets of 
' boundary data, and the solution is required only near the boundary (as happens, e.g., in 

the computational modeling of scattering problems, or in engineering design of linearly 
elastic solids). In this case, a modified version of the accelerated nested dissection scheme 
^ . can execute any solve beyond the first in ©(A^boundary) operations, where A^boundary de- 

' notes the number of points on the boundary. Typically, A^boundary N . Numerical 



in 



examples demonstrate the effectiveness of the procedure for a broad range of elliptic 



fSJ ' PDEs that includes both the Laplace and Helmholtz equations. 

o ■ 



1. Introduction 

1.1. Problem formulation. This paper presents a fast solver for homogeneous boundary value problems 
(EVPs) of the form 

—Au{x) + b{x)u.j;{x) + c{x)uy(x) + d{x)u{x) ~ x G ft 

u{x) = g{x) X eT, 

where = [0, 1]^ is the unit square in R^, where F is the boundary of 51, and where 6, c, and d are functions 
on Q. We assume that the only information sought is the normal derivative of u at F. In other words, the 
objective is to construct an approximation to the Dirichlet-to-Neumann operator associated with the elliptic 
differential operator in (1). 

The proposed solver is particularly efficient for situations where (1) needs to be solved for a sequence of different 
boundary data functions g. The solver has two steps: (1) Build the approximate Dirichlet-to-Neumann operator 
for a given set of functions 6, c, and d. (2) Determine the Neumann data du/dn for any given Dirichlet data 
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g by applying the approximate Dirichlet-to-Neumann operator. The key claim of the paper is that the "build 
stage" can be executed in 0{N) operations, and the "solve stage" can be executed in 0{N^'^) operations. In 
contrast, classical nested dissection requires 0(iV^'^) and 0{N) operations for the two steps, respectively. 

1.2. Motivation. While the present paper addresses the specific BVP (1) on a simple square domain, the 
technique can be extended for building solution operators to elliptic boundary value problems of the form 



(2) 



Au{x) ^,f{x) a; e 

Bu{x) ^g(x) a; G r, 

where f2 is a domain in or M.^ with boundary F, where A is an elliptic partial differential operator, and where 
i? is a trace operator (representing boundary conditions like Dirichlet. Neumann, mixed, etc.). The goal of this 
paper is to illustrate that when the body load / is zero and the solution u and/or its derivatives are sought only 
near the boundary, the relevant solution operator can be constructed at moderate cost, and applied almost 
instantaneously. This opens up the possibility of high accuracy computational simulations to be carried out 
in real time for 3D problems such as elasticity involving composite materials, electrostatics in domains with 
variable conductivity, acoustic and electromagnetic scattering problems (at long and intermediate wave-lengths 
at least), and many others. 

In some applications such as seismic testing and automatic multilevel substructuring (AMLS), there is a small 
number of localized body loads inside of the domain. This paper details how the solution operator can be 
found with an increased cost which is still less than classic techniques. 



1.3. Discretization. The method described is applicable to a variety of geometries and discretization schemes 
(finite elements, finite differences, etc.). For simplicity of presentation, we restrict our attention to the model 
problem where a square domain is discretized via a finite difference scheme on a regular n x n square mesh. 
The resulting linear system takes the form 

(3) Au = b 

where A is an x sparse matrix. We let N = n? denote the total number of grid points. 

Example: When (1) represents the Laplace equation (b — c — d — 0) and the standard five-point finite difference stencil 
is used in the discretization, A consists of n x n blocks, each of size n x n, 



A = 



B 



I 







I B 



I B 



-I B 



where B = 



4 -1 
-1 4 
-1 







-1 



and where I is the n x n identity matrix. 
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1.4. Existing fast solvers. There already exist many efficient techniques for solving (3), including: 

Iterative methods: These techniques construct a sequence of successively more accurate approximate solu- 
tions by applying the matrix A to a sequence of vectors. Since the N x N matrix A has 0{N) non-zero entries, 
the resulting solver has 0{N) complexity whenever convergence is fast. It is difficult to predict the convergence 
rate of iterative methods and often a customized pre-conditioner is required to accelerate the schemes. 

Multigrid methods: These techniques can be viewed as a special case of iterative methods. They can in 
certain circumstances reach very high performance by decomposing the matrix in a sequence of different scales; 
since the matrix is well-conditioned on each scale, very fast convergence often results. 

Direct methods: Direct solvers (such as Gaussian elimination) which compute a solution in a single shot are 
considered more stable and robust than iterative methods. Proper ordering of the nodes often allows Gaussian 
elimination to be executed at 0{N^'^) complexity ([6]), and the resulting "nested dissection" approach is quite 
competitive for moderate problem sizes (up to about N ^ 10^). More recently, it has been shown that by 
exploiting additional structure in the coefficient matrix, the nested dissection method can be accelerated to 
(close to) linear complexity, see, e.g., [3, 10, 12]. 

1.5. Novelty of the present work and comparison to existing methodology. The proposed solver 
is based on the classical nested dissection algorithm of [6]. The key distinction to classical nested dissection 
is that special structure in the dense so called "frontal matrices" are exploited to reduce the cost of the 
pre-computation from 0{N^-^) to 0{N), and the cost of the solve from 0{N) to 0{N'^'^). To be precise, 
we approximate off-diagonal blocks of the dense frontal matrices by low-rank matrices; we do this using 
the structured matrix format described in [4, 8], which can be viewed as a variation of the well established 
"?^-matrix" and "H^-matrix" formats of Hackbusch and co-workers (see, e.g., [9, 1]). 

The observation that the dense matrix computations involving the frontal matrices can be accelerated using 
structured matrix algebra has recently been made in, e.g., [3, 10, 11, 12]. Our work is slightly different in that it 
is based on the hierarchical construction of Schur complements, and directly leads to a discrete approximation of 
the Dirichlet-to-Neumann operator on the full domain. This greatly simplifies the construction of the boundary- 
to-boundary solution operator. It also leads to algorithms that can readily handle a problem involving a sparse 
body load (see Section 7.3). 

While the present work considers only a regular square grid in 2D, the method can be extended to more general 
grids, and to 3D problems. A discussion of the expected performance in these cases can be found in Section 
8. Our key claims regarding asymptotic complexity are summarized in Table 1. 

An early version of the work reported appeared in the Ph.D. dissertation [7]. 

Remark 1.1. State-of-the-art iterative solvers such as, e.g., multigrid will sometimes outperform the new 
accelerated nested dissection technique for a stand-alone solve. However, the new solver is much faster for 
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subsequent solves; its asymptotic cost is only 0{N^^^) and its practical efficiency is such that a problem on a 
grid with 4000 x 4000 nodes can be solved in only 0.1 seconds on a standard office laptop. Moreover, since 
the solver is direct, it handles with ease many problems that are challenging for iterative methods (including 
multigrid), such as for instance vibration problems in situations where the domain is much larger than the 
wave-length. 





Build solution operators 


Solve with no body load 


Solve with general body load 


2D 




N^/^ (AT) 


iV (iVlogiV) 


3D 


iV4/3 (JV2) 


N (Ar4/3) 


NlogN (N'^^'-'logN) 



Table 1. Summary of asymptotic costs of the proposed direct solver in two and three 
dimensions. For comparison, the costs of classical nested dissection are given in parenthesis. 
For special (e.g. sparsely supported) body loads, better asymptotics than those listed in the 
table can be achieved, see Section 7.3. The asymptotics given for 3D problems are predictions 
for a simplistic generalization of the proposed scheme; it is likely that these numbers could be 
further improved. 

1.6. Outline of paper. The paper describes an 0{N) variation of the nested dissection method which com- 
putes the global Dirichlet-to-Neumann operator. Section 2 describes a hierarchical partitioning of the grid 
into a quad-tree of nested boxes. Section 3 describes a variation of the classical nested dissection technique 
that computes a hierarchy of solution operators for each box in the quad-tree. The solution operators have 
internal structure (see Section 4) which is exploited to improve the complexity of the method from 0{N^'^) 
to 0{N), see Sections 5 and 6. Section 7 reports the results of numerical experiments that substantiate our 
claims on the asymptotic complexity and accuracy of the method. 

2. Tree structure 

The direct solver described in this note is based on the classical nested dissection algorithm, and uses an 
analogous (but not identical) tree structure on the computational grid. This section formally defines the tree 
structure for our simple model geometry. 

Let Q denote the square domain introduced in Section 1, and suppose that it is discretized using a uniform 
n X n grid. Let N = denote the number of points in the grid, and let Mcaf denote a tuning parameter 
chosen so that a matrix of size Mcaf x Mcaf can be inverted quickly by brute force. The optimal choice of 
iVioaf depends on the computing environment, but we have found that Moaf = 4096 is often a good choice. Let 
L be the smallest integer such that when f2 is partitioned into 4^ equisized boxes, each box contains no more 
than TVieaf points. These 4^ small boxes are called the leaves of the tree. Merge the leaves by sets of fours 
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into boxes with twice the side length, to form the 4^~^ boxes that make up the next level in the tree. This 
process is repeated until 17 is recovered. We call ft the root of the tree. 

The set consisting of all boxes of the same size forms what we call a level. We label the levels using the integer 
£ = 0, 1, 2, . . . , L, with £ = denoting the root, and £ = L denoting the leaves. 

3. A VARIATION OF THE NESTED DISSECTION ALGORITHM 

This section describes a direct solver that is particularly fast for what we call "pure" boundary value problems 
such as (1) in which there is no body load, and where the solution is sought only near the boundary. The idea 
is to construct a solution operator G that maps the given boundary data to the sought potential values (or 
flows) on the boundary. Letting A^b denote the number of nodes on the boundary of the domain, G is a dense 
iVb X A^b matrix. 

Technically, the solution operator G is constructed via a divide-and-conquer approach (analogous to the one 
used in the classical nested dissection scheme): First a solution operator is constructed for each "leaf" in the 
quadtree described in Section 2, then solution operators for larger boxes are constructed via a hierarchical 
merging process in a single sweep through the tree, going from smaller to larger boxes. 

For a grid with N nodes, the process described in this section requires 0{N^'^) operations to construct the 
solution operator, and then each subsequent solve (which consists merely of applying the solution operator) 
requires 0{N) operations. Techniques for accelerating these two costs to 0{N) and 0(A°-^), respectively, are 
then described in Sections 4, 5 and 6. 

3.1. The solution operator and the Schur complement. This subsection provides a precise definition of 
the concept of a "solution operator" associated with a subdomain P of the computational grid. For simplicity, 
we assume that P is a square or rectangular domain. We partition P into interior nodes and boundary nodes: 

P^PiliPh, 

where Pi is defined as the set of nodes that have all four neighbors inside P, see Figure 1. (Note that the set 
P consists of all nodes at which the potential is unknown, and Pb is the outermost ring of these nodes, not 
the nodes at which Dirichlet data is prescribed.) 

Let Mb and denote the potentials at the boundary nodes and the interior nodes, respectively. Reordering 
the equilibrium equation (restricted to P), we find that Mb and Mj must satisfy 



(4) 


Ab,b Ab,i 




Mb 




/b 












Ai.b Ai i 




Mi 




/i 



Eliminating Mi from (4), we find 

SMb = /b - Ab.iAi^iVij 



(a) (b) 

Figure 1. (a) Labeling of nodes for constructing the Schur complement of a leaf. are 
the interior nodes (solid), and are the boundary nodes (hollow), (b) After the merge, 
all internal nodes are "eliminated" but now all nodes communicate directly (i.e. the Schur 
complement S is dense). 

where S is the matrix 

(5) S = Ab,b - Ab,i Ar;^ Ai.b. 

We refer to S as the Schur complement associated with the subdomain P; the solution operator is then G = S~^. 
In the case of no body loads fi = 0, thus the update to the right hand side on the boundary is not necessary. 




(a) (b) (c) 

Figure 2. (a) Labeling of nodes for the merge operation described in Section 3. The nodes 
in Pi and P3 arc round, and the nodes in P2 and P4 are square. The solid nodes are interior to 
the union of the two boxes fi^'^' and f2'°-'. (b) Connections between nodes before the merge, 
(c) Connections between nodes after eliminating the interior (solid) nodes. 

3.2. Merging two Schur complements. In this section, we present a technique for merging the Schur 
complements for two adjacent boxes. Let us call the two boxes ri*-^-* and (for "west" and "east"). Further, 
let p(^) and denote the nodes on the boundaries of these two boxes, and let S'^-' and 5'°-' denote Schur 
complements supported on these two sets of boundary nodes, see Figure 2 (a). 

The objective of the merge is to eliminate the nodes that are now "interior" to the larger box formed by the 
union of the two smaller boxes; these nodes are marked as blue in Figure 2 (b). To eliminate these points, we 
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first partition the nodes in P^'") and P''^^ so that 



(6) 



p(w) ^ p^up^^ and P^'') = Pa U P4, 



and so that Pi U P2 forms the boundary of the larger box, while the nodes in P3 U P4 are interior, see Figure 
2 (c). Partition the Schur complements S^™) and 5^°^ analogously: 



Sii S13 
S31 S33 



and S(°) 



S22 S24 
S42 S44 



Supposing that the right hand side has been updated to account for any interior body loads, equation (3) 
restricted to the union of the two boxes now reads 



Sii 


A12 


Sl3 


Ai4 


A21 


S22 


A23 


S24 


S31 


A32 


S33 


A34 


A41 


S24 


A43 


S44 _ 



(7) 



where Aij are the relevant sub- matrices of the original discrete Laplacian A. From (7), one finds that the Schur 
complement of the union box is 



Ml 




fl 


U2 




/2 


U3 




/3 


U4 




/4 



(8) 



S = 



Sii A12 
A21 S22 



Sl3 

A23 



Ai4 

S24 



S33 

A43 



A34 

S44 



S31 
A41 



A32 

S42 



The updated right hand side is 



fl 




fl 





Sl3 Ai4 
A23 S24 



S33 A34 
A43 S44 



/3 
fi 



Remark 3.1. The matrices A14, A23, A32, and A41 are typically very sparse. In fact, when equation (1) is 
discretized with a 5-point stencil, these matrices are identically zero. 



Remark 3.2. While the Schur complement is a dense matrix (cf. Figure 2), the interactions between distant 
points can to high precision be approximated by low rank matrices. This property can be conjectured by 
inspecting a computed Schur complement and observing that each row is smooth away from the diagonal. 
Figure (3) illustrates this point. The figure also illustrates that the Schur complement is strongly diagonal 
dominant, which is consistent with the fact that it is a discrete analog of the Dirichlct-to-Ncumann operator, 
which is a hyper-singular integral operator (it reduces the smoothness of any boundary function it operates 
on in a manner similar to a differentiation operator). 
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50 100 150 200 250 50 100 150 200 250 

(a) (b) 

Figure 3. (a) Three rows of a typical Schur complement for a Laplace problem with 252 
points on the boundary. The 50"^ (o), 100*^ (x), and 150"^ (□) rows are shown. Note how 
the matrix is completely dominated by the elements close to the diagonal, (b) The same plot 
but with a different scale on the vertical axis to show the smoothness in the far-field. 

3.3. The full algorithm. For future reference, let us summarize the algorithm described: 

(1) Construct a quad-tree: Partition the grid into a hierarchy of boxes as described in Section 2. 

(2) Process the leaves: For each leaf box in the tree, construct its Schur complement as described in 
Section 3.1. If the box has body loads, update the right hand side. 

(3) Hierarchical merge: Loop over all levels of the tree, from finer to smaller. For each box on a level, 
compute its Schur complement by merging the (already computed) Schur complements of its children. 
Note that the merge of four children can be executed via three of the pair-wise merges described in 
Section 3.2. If the interior points have body loads, update the right hand side. 

(4) Process the root of the tree: After completing Step 3, the Schur complement for the entire domain is 
available. Invert (or factor) it to construct the solution operator. 

Remark 3.3. For simplicity, the algorithm is described in a level-by-level manner (process all leaves first, 
then proceed one level at a time in going upwards). In fact, there is flexibility to travel through the tree 
in any order that ensures that no node is processed before its children. Since all Schur complements can be 
discarded once their information has been passed on to a parent, smarter orderings can greatly reduce the 
memory requirements [5]. 

3.4. Asymptotic complexity of the algorithm. As before, let N — denote the total number of points 
in the grid, let iVicaf denote the maximum number of points on a leaf, and let L denote the number of levels 
so that TV < 4^ Meaf • 
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The cost to process one leaf in Step 2 in Section 3.3 is 0{N^^^f) (exploiting that the matrix Aii in (5) is 
band-diagonal). Since there are 4^ leaves, the total cost of Step 2 is therefore 4^ ^loaf ^ -^Moaf- Since A^ioaf 
is a small constant number (in principle one could set A^ioaf = 1) the leaf processing cost is 0{N). 

Next consider the cost of constructing the Schur complement of a box on level £ in executing Step 3 in Section 
3.3. Note that all boxes involved have 0{n2~^) points along each side. Since some matrices in (8) are dense, 
the cost for each merge is proportional to {n 2^^)"^ = 2^^^. Since we need to compute 2^^ Schur complements 
on level £, the total cost of Step 3 is then J2e=i 2-3« = „3 ^i^^ ~ n^. 

Since the cost of the final inversion/factorization in Step 4 is 0{n^), the total cost of the algorithm in Section 
3.3 is 0{n^) = 0{N^-^). 



4. Compressible matrices 

To improve the scaling of the nested dissection method, a more efficient technique for evaluating (8) will be 
implemented. We will exploit that while the matrices Sij are all dense, they in the present context have 
additional structure: is when i =/= j to high precision rank deficient, and Su has a structure that we call 
Hierarchically Block Separable (HBS). This section briefly describes the HBS property, for details see [8]. We 
note that the HBS property is very similar to the concept of Hierarchically Semi- Separable (HSS) matrices 
[13, 2] which has previously been used in an analogous context [3]. Other researchers have used the somewhat 
related H-matrix concept for similar purposes [10, 12]. 

4.1. Block separable. Let H be an mp x mp matrix that is blocked into p x p blocks, each of size m x m. 

We say that H is "block separable" with "block-rank" k if for r = 1, 2, . . . , p, there exist m x k matrices Ur 
and Vt such that each off-diagonal block Ho-.r of H admits the factorization 

H<x,r = H,,,^ V*, cr,r e {1, 2, p}, a 7^ r. 

m X m m X k k x k k x m 

Observe that the columns of Do- must form a basis for the columns of all off-diagonal blocks in row cr, and 
analogously, the columns of V,- must form a basis for the rows in all the off-diagonal blocks in column r. When 
(9) holds, the matrix H admits a block factorization 

H = U H V* -t- D, 

(10) 

mp X mp mp x kp kp x kp kp x mp mp x mp 

where 

U = diag(Ui, U2, . . . , Up), V = diag(Vi, V2, . . . , Vp), D = diag(Di, D2, . . . , Dp), 



(9) 
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and 








H12 


H13 




H21 





H23 


H ^ 










H31 


H32 






4.2. Heirarchically Block-Separable. Informally speaking, a matrix H is Heirarchically Block- Separable 
(HBS), if it is amenable to a telescoping block factorization. In other words, in addition to the matrix H being 
block separable, so is H once it has been reblocked to form a matrix with p/2 x p/2 blocks. Likewise, the 
middle matrix from the block separable factorization of H will be block separable, etc. 

In this section, we describe properties and the factored representation of HBS matrices. Details on constructing 
the factorization are provided in [8]. 



4.2.1. A binary tree structure. The HBS representation of an M x M matrix H is based on a partition of the 
index vector / = [1, 2, . . . , M] into a binary tree structure. We let / form the root of the tree, and give it the 
index I, Ii = I. We next split the root into two roughly equi-sized vectors I2 and /3 so that Ii ~ I2U I3. The 
full tree is then formed by continuing to subdivide any interval that holds more than some preset fixed number 
m of indices. We use the integers £ = 0, 1, . . . , L to label the different levels, with denoting the coarsest 
level. A leaf is a node corresponding to a vector that never got split. For a non-leaf node r, its children are 
the two boxes ai and CT2 such that 1^ = I^-^ U 7^2, and r is then the parent of ci and CT2. Two boxes with the 
same parent are called siblings. These definitions are illustrated in Figure 4. 



4.3. Definition of the HBS property. We now define what it means for an M x M matrix H to be 
hierarchically block separable with respect to a given binary tree T that partitions the index vector J ~ 
[1, 2, . . . , M]. For simplicity, we suppose that for every leaf node r the index vector holds precisely m 
points, so that M = m2^. Then H is HBS with block rank k if the following two conditions hold: 
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V, 


m X k 


Basis for the rows in the blocks in column r. 


For each parent 




2k X 2k 


Interactions between the children of r. 


node r: 




2k xk 


Basis for the columns in the (reduced) blocks in row r. 






2k xk 


Basis for the rows in the (reduced) blocks in column r. 



Figure 5. An HBS matrix H associated with a tree T is fully specified if the factors listed 
above are provided. 

(I) Assumption on ranks of off-diagonal blocks at the finest level: For any two distinct leaf nodes r and r', 
define the n x n matrix 

(II) Hr,r' = H{IrJr')- 

Then there must exist matrices Ut, Vt-/, and H^ ,-/ such that 

Hr^r' = Ht^t-' V*,. 

m X m m X k k x k k x m 



(12) 



(2) Assumption on ranks of off-diagonal blocks on level £ = i — 1, L — 2, . . . , 1; The rank assumption at level 
i is defined in terms of the blocks constructed on the next finer level £ For any distinct nodes r and r' on 
level £ with children cri , CT2 and a'i,a2, respectively, define 



CJi jCT-j^ ' ' CJi ,(72 



(13) hr,r' = 

Then there must exist matrices U,-, V,-/, and Hr,T' such that 
(14) 



Ht,t' — Ur ^t,t' V*,. 

2k X 2k 2k X k k x 



An HBS matrix is now fully described if the basis matrices Ur and Vt- are provided for each node r, and in 
addition, we are for each leaf r given the m x m matrix 

(15) Dr = H{U,Ir), 

and for each parent node r with children tJi and a2 we are given the 2k x 2k matrix 

H, 



(16) 



Br = 







Observe in particular that the matrices Ha^-^ u^ are only required when {(Ti,a2} forms a sibling pair. Figure 5 
summarizes the required matrices. 





= diag(D^ : 


r is a box on level £), 


£ = 0, 1, .. 


.,L, 




= diag(Ur : 


T is a box on level £), 


£= 1, 2, .. 


., L, 




= diag(V^ : 


T is a box on level £), 


£=1,2,.. 


., L, 




= diag(B^ : 


T is a box on level £), 


£ = 0,l,.. 


., L-1, 



4.4. Telescoping factorization. Given the matrices defined in the previous section, we define the following 
block diagonal factors: 

(17) 
(18) 
(19) 
(20) 

Furthermore, we let H^^^ denote the block matrix whose diagonal blocks are zero, and whose off-diagonal blocks 
arc the blocks Hi-,t' for all distinct r, r' on level £. With these definitions, 

(21) - ' 

m2^xn2^ m2^xfc2^ k2^ x k2^ fc2^xm2^ m2^ x m2^ 

ior £ = L — 1, L — 2, . . . , 1 we have 

H(«+i) = nW fV(^))* + B^^)- 

(22) - ' 

k 2^+1 X k 2^+1 k 2'+^ xk2' k2'xk2' k2' x k 2^+^ k 2^+^ x k 2'+^ 

and finally 

(23) H« = B(o). 

5. Fast arithmetic operations on HBS matrices 

Arithmetic operations involving dense HBS matrices of size N x N can often be executed in 0{N) operations. 
This fast matrix algebra is vital for achieving linear complexity in our direct solver. This section provides 
a brief introduction to the HBS matrix algebra. We describe the operations wc need (inversion, addition, 
and low-rank update) in some detail for the single level "block separable" format. The generalization to the 
multi-level "hierarchically block separable" format is briefly described for the case of matrix inversion. A full 
description of all algorithms required is given in [7] , which is related to the earlier work [4] . 

Before we start, we recall that a block separable matrix H consisting of p x p blocks, each of size m x m, and 
with "HBS-rank" k < m, admits the factorization 

H = U H V* + D. 

(24) 

mp X mp mp x kp kp x kp kp x mp mp x mp 

5.1. Inversion of a block separable matrix. The decomposition (24) represents H as a sum of one term 
UHV* that is "low rank," and one term D that is easily invertible (since it is block diagonal). By modifying 
the classical Woodbury formula for inversion of a matrix perturbed by the addition of a low-rank term, it can 
be shown that (see Lemma 3.1 of [8]) 

(25) = E(H-f D)-iF*-hG, 



13 

where 

(26) D = (V* D-^ Uy\ 

(27) E=D-iUD, 

(28) F=(DV*D"i)*, 

(29) G= -D-iUDV*D-\ 

assuming the inverses in formulas (25) — (29) ah exist. Now observe that the matrices D, E, F, and G can 
aU easily be computed since the formulas defining them involve only block-diagonal matrices. In consequence, 
(25) reduces the task of inverting the big (size mp x mp) matrix H to the task of inverting the small (size 
kp X kp) matrix H + D. 

When H is not only "block separable", but "hierarchically block separable", the process can be repeated 
recursively by exploiting that H + D is itself amenable to accelerated inversion, etc. The resulting process 
is somewhat tricky to analyze, but leads to very clean codes. To illustrate, we include Algorithm 1 which 
shows the multi-level 0{N) inversion algorithm for an HBS matrix H. The algorithm takes as input the 
factors {Ur, V.^, D.^, Qt}t representing H (cf. Figure 5), and outputs an analogous set of factors {E,-, F^, G^}^ 
representing H^^. With these factors, the matrix- vector multiplication y = H^^x can be executed via the 
procedure described in Algorithm 2. 

5.2. Addition of two block separable matrices. Let H'^ and be block separable matrices with factor- 
izations 

= U^H^V^* + D^, and = U^H^V^* + D^. 

Then H = H"'^ + can be written in block separable form via 





To restore (30) to block separable form, permute the rows and columns of [U'^ U^] and [V'^V-^] to attain 
block diagonal form, then re-orthogonalize the diagonal blocks. This process in principle results in a matrix H 
whose HBS-rank is the sum of the HBS-ranks of H"^ and H^. In practice, this rank increase can be combatted 
by numerically recompressing the basis matrices, and updating the middle factor as needed. For details, as 
well as the extension to a multi-level scheme, see [4, 7]. 

5.3. Addition of a block separable matrix with a low rank matrix. Let = QR be a fc-rank matrix 
where Q and R* are of size mp x k. We would like to add to the block separable matrix H"^. Since we 
already know how to add two block separable matrices, we choose to rewrite in block separable form. 
Without loss of generality, assume Q is orthogonal. Partition Q into p blocks of size m x k. The blocks 
make up the matrix U^. Likewise partition R into p blocks of size k x m. The block matrix has entries 



(30) H = H^ + H^= [U^U^] 



[V^V^]* + (D^ + D^ 
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Algorithm 1 (inversion of an HBS matrix) 


Given 


factors {Ur, V^, D^, B^jr representing an HBS matrix H, this algorithm constructs factors 


{Er, Fr, Gt}t representing H ^. 




loop over all levels, finer to coarser, I 


= L, i-l, 


loop over all boxes r on level I, 






if r is a leaf node 












else 










Let (Ti 


and (72 denote the children of r. 








^02,^1 DcT2 






end if 










= (v;d-iuO 






E, = Dr. 












G. = - 






1 


end loop 








end loop 








Gi = 


D2 62,3 
63,2 D3 




-1 





Dt = QtRt for r = 1, . . . ,p. To construct the matrices V^, for each t = 1, . . . ,p, the matrix Rt- is factorized 
into Rt-Vt* where the matrix V,- is orthogonal. The matrices Rt- make up the entries of H^. 



6. Accelerating the nested dissection algorithm 



In this section, we apply the structured matrix techniques introduced in Sections 4 and 5 to reduce the 
complexity of the solver of Section 3 from 0{N^-^) to 0{N). The key task that we need to accelerate is the 
construction of the Schur complement for a parent box from the Schur complements of its two children. The 
formula that needs to be evaluated is, cf. (8), 



(31) 



Six A12 
A21 S22 



S13 Ai4 

A23 S24 



S33 A34 
A43 S44 



S31 A32 
A41 S42 



The acceleration can be broken into three steps which utilize important properties of each submatrix. 
Step 1 The inverse in equation (31) never needs to be constructed. Instead the solution of 



(32) 


S33 


A34 




X3 




Z3 












A43 


S44 




X4 




Z4 
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Algorithm 2 (application of inverse) 
Given X, compute y = H^^a; using the compressed representation o/H^^ resulting from Algorithm 1. 
loop over all leaf boxes r 

end loop 

loop over all levels, finer to coarser, £ — L, L ~ 1, . . . , 1 
loop over all parent boxes r on level £, 

Let ai and (72 denote the children of r. 



end loop 
end loop 



X(j 

Xq 



V2 




*2 


= Gi 








X3 



loop over all levels, coarser to finer, £ = 1, 2, . . 
loop over all parent boxes r on level £ 

Let (Ti and 02 denote the children of r. 



Xq 



end loop 
end loop 

loop over all leaf boxes r 

V{It) = ErqT + Gt x{Ir) 

end loop 



can be found by rapidly via a block solve. Then 



X3 
X4 



is given by 



X4 — (S44 — A43S33^A34) (Z4 — A43S33^Z3) 



and 



X3 — 533^73 — S33^A34X4 



Since S33 is HBS, an approximation of its inverse can be constructed and applied rapidly. The 
matrix A43S33^A34 is also HBS, since A43 and A34 arc anti-diagonal matrices (ie. all the entries are 
zero except those on the diagonal going from the lovifer left corner to the upper right corner). Hence 
S44 — A43S33^A34 can be added quickly. The resulting matrix is HBS and can be inverted with linear 
scaling computational cost. 
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Let 



X31 X; 

X41 X, 



32 



42 



denote the result of applying the block inverse to 



S31 A 



Ml 



32 



342 



Step 2 The matrices S13, and S24 are low rank, thus we can rewrite the matrices in their low rank factored 
form as L13R13 and L24R24. Using this notation, the second term in (31) can be expressed in a low 
rank factored form 



(33) 



Sl3 Ai4 

A23 S24 



X31 X32 
X41 X42 



L13R13X31 + A14X41 L13R13X32 + A14X42 
A23X31 + L24R24X41 A23X32 + L24R24X42 



Since all the matrices kjk are very sparse, the four blocks on the right hand side of (33) are of low 
rank. (Recall in the case of the five point stencil, the matrices A14 and A23 are zero). 
LiiRii L12R12 

Let denote the low rank factorization of the blocks in (33). 

L21R21 L22R22 

Step 3 Now we add the two terms that comprise the Schur complement 

Sii A12 LiiRii L12R12 

A21 S22 L21R21 L22R22 

The diagonal block entries are HBS + low rank which is computed via Algorithms 3 and 2. The 
off-diagonal blocks are low rank with a very sparse update which is also low rank. The result is one 
HBS matrix. 



Remark 6.1. As a practical matter, structured matrix algebra should not be introduced until the Schur 
complements get fairly large (roughly of size 1000 x 1000 or so). This means that at the lower levels, formula 
(31) is evaluated using dense matrix algebra for all matrices Sij. 



7. Numerical experiments 



In this section, we illustrate the capabilities of the proposed method for constructing solution operators for 
problems of the form 

{-AM(a;) + h{x)ux{x) + c{x)uy{x) + d{x)u{x) = /(a;), a; e = [0, 1]^, 
u{x) = g{x), a; G F, 

where h, c, d, and / are functions defined on VL, and the boundary data g is defined on F. Section 7.1 
investigates the asymptotic complexity of the direct solver for several different differential operators (Laplace, 
Hclmholtz, convection-diffusion, etc.) for the case where the body load is zero (/ = 0). It also reports on the 
accuracy for each case. Section 7.2 reports the execution times of the build and the solve stages of the direct 
solver. Section 7.3 reports on the performance of the method for a problem with localized body loads. 

For all problems, the domain is discretized with a uniform grid of n x n points so that the grid spacing is 
/i = l/(n — 1). We let N = n'^ denote the total number of nodes. Equation (34) is discretized with the finite 
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difference scheme corresponding to the five point stencil. For example, when a node k is in the interior of il, 
the discretization of the differential operator in (34) is 

[Au{k) ~ u{kn) - u{ks) - u{k^) - u(fce)] + 



where fee, fc„, fc^, kg denote the grid points to the "east", "north", "west", and "south" of k, respectively. The 
HBS matrix algebra was run with a local tolerance of e = 10^^. 

All experiments are executed on a Lenovo laptop computer with a 2.4GHz Intel 15 processor and 8GB of RAM. 
The method was implemented in Matlab, which we judged adequate since the main purpose of the experiments 
is to substantiate our claims in regards to asymptotic complexity. It should be noted, however, that even this 
non-optimized code runs quite fast, see, e.g.. Table 4. 

Recall that the approximate solution (or Dirichlet-to-Neumann) operator G is the inverse of the Schur com- 
plement S for the full domain, see Section 3.1. 

7.1. Range of problems v^fith optimal scaling. The proposed method for constructing Dirichlet-to-Neumann 
operators has been applied to several problems to investigate its asymptotic complexity. The problems are: 

• Laplace: Let b{x) = c{x) = d{x) = 0. 

• Dijfusion- Convection I: Let c{x) = d{x) = and the convection in the x direction be constant: 
h{x) = 100. 

• Diffusion-Convection II: Same as Diffusion-Convection /, but with h{x) — 1000. 

• Diffusion- convection III: Introduce a divergence free convection field by setting b{x) = 125cos(47ry) 
and c{x) = 125 sin(47ra;), and d{x) = 0. 

• Diffusion-convection IV: Introduce a convection field with sources and sinks by setting Let b{x) — 
125 cos(47r2;), c{x) ~ 125 sin(47r?/), and d{x) — 0. 

• Helmholtz I: Consider the Helmholtz equation corresponding to a domain that is roughly 1.5 x 1.5 
wavelengths large: b{x) ~ c{x) ~ and d{x) = —100. 

• Helmholtz II: Consider the Helmholtz equation corresponding to a domain that is roughly 10 x 10 
wavelengths large: b{x) — c{x) — and d{x) = —4005. 

• Helmholtz III: Consider the Helmholtz equation near a resonance: b{x) — c{x) — and d{x) = 
— Aio + 10^^, where Aio is the tenth eigenvalue of the discrete Laplace operator (note that these are 
known analytically). 

• Helmholtz IV: Consider a sequence of Helmholtz problems where the wave-number is increased to keep 
a constant 40 points per wave-length: b{x) = c{x) ~ and dix) = — (^^)^. 

• Random Laplacian I: Let the matrix A reflect an elliptic equilibrium problem on a network instead of 
a continuum PDE. In this case, the network is the square grid where each link is assigned a random 
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conductivity between varying between 1 and 2. The potential at any single node is the weighted 
average of the potentials of its four neighbors, where the weights are the conductivities. 
• Random Laplacian II: Same as Random Laplacian I, but now the conductivities vary between 1 and 
1000. 

Table 2 reports the amount of memory M (n) in MB required to store the Dirichlet-to-Neumann operator G 
for each problem; it also reports the fraction M{n)/n. Our claim in regards to compressibility amounts to 
a prediction that M{n)/n will remain stable as n grows for all problems, except for Hclmholtz IV. Table 2 
demonstrates that this scaling holds true in the range n = [256, 512, 1024, 2048]. 

Table 3 reports two errors measured on a grid of size 1024 x 1024: 
ei - the relative P-erroi in the vector Gr where r is a unit vector of random direction 

62 - the relative Z^-error in the vector G r where r is smooth. 
The exact value of G r was found by using GMRES to solve the full original linear system Ax = r, where f is 

a vector of length such that r\r — r and r|s^/p = 0. A slight loss in accuracy is observed for Hclmholtz I, 

IV and Random II problems. There is a substantial loss in accuracy for the Hclmholtz HI problem. This is to 

be expected since the matrix A is close to being numerically singular. 



Problem 


n = 256 


n ^ 512 


n = 1024 


n = 2048 


Laplace 


0.83 (3.2C-3) 


1.62 (3.2e-3) 


3.18 (3.1e-3) 


6.27 (3.1e-3) 


DiffConv I 


0.91 (3.5C-3) 


1.75 (3.4e-3) 


3.32 (3.2c-3) 


6.52 (3.2c-3) 


DiffConv II 


1.10 (4.3e-3) 


1.84 (3.6e-3) 


3.62 (3.5C-3) 


6.87 (3.4e-3) 


DiffConv III 


0.86 (3.4C-3) 


1.70 (3.3e-3) 


3.32 (3.2e-3) 


6.55 (3.3e-3) 


DiffConv IV 


0.97 (3.8C-3) 


1.83 (3.6e-3) 


3.43 (3.3e-3) 


6.59 (3.2e-3) 


Hclmholtz I 


0.86 (3.4e-3) 


1.67 (3.3e-3) 


3.25 (3.2e-3) 


6.34 (3.1e-3) 


Hclmholtz II 


1.04 (4.1e-3) 


1.91 (3.7e-3) 


3.56 (3.5e-3) 


6.78 (3.3e-3) 


Hclmholtz III 


0.86 (3.4e-3) 


1.67 (3.3e-3) 


3.29 (3.2e-3) 


6.42 (3.1e-3) 


Hclmholtz IV 


0.89 (3.5C-3) 


1.74 (3.4e-3) 


3.59 (3.5e-3) 


7.89 (3.9e-3) 


Random I 


0.83 (3.2C-3) 


1.64 (3.2e-3) 


3.22 (3.1e-3) 


6.34 (3.1e-3) 


Random II 


0.82 (3.2e-3) 


1.64 (3.2e-3) 


3.23 (3.2e-3) 


6.36 (3.1e-3) 



Table 2. Memory M{n) in MB required to store the solution operator for the problems 
listed in Section 7.1. The quantity ^^l^^ is reported in parenthesis. 

7.2. Performance. In this section we report the computational times required for the experiments that were 
described in Section 7.1, for grids of size 512 x 512 to 4096 x 4096. Since the times were very similar across 
most experiments, we report only those for Laplace, which represents the "typical" times observed, and for 
Hclmholtz IV, which was the most challenging and slowest of all the experiments conducted. Table 4 reports: 
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Problem 


ei 


62 


Laplace 


6.3e-7 


3.6e-7 


DiffConv I 


1.5e-6 


1.3e-6 


DiffConv II 


8.7e-6 


8.2e-6 


DiffConv III 


5.6e-7 


3.4e-7 


DiffConv IV 


4.1e-8 


4.1e-8 


Helmholtz I 


1.4e-4 


4.8e-7 


Helmholtz II 


l.le-6 


5.1e-6 


Helmholtz III 


1.2e-5 


5.7e-4 


Helmholtz IV 


8.2e-4 


1.2e-3 


Random I 


1.8e-7 


1.2e-7 


Random 11 


1.4e-5 


8.1e-6 



Table 3. Errors ei and 62 for the solution operator for the problems listed in Section 7.1. 



Tbuiid - the time in seconds for constructing the Dirichlet-to-Neumann operator, 

Tsoive - the time in seconds for applying the Dirichlct-to-Neumann operator to a vector. 
Let us draw the reader's attention to some interesting results in Table 4: 

• A principal claim we make in terms of performance of the proposed method is that after an initial 
pre-computation in which the solution operator is built, the time Tsoivc required to process a new 
vector of Dirichlet data is small. Table 4 clearly bears out our claim that Tsoivc scales linearly with 
the number of points on the boundary, in other words Tsoivc ^ N^'^ . Moreover, the scaling constant 
turns out to be small, for a grid of size 4096 x 4096, the solve time is only 0.1 seconds. 

• The other key claim made is that the time to build the solution operator in the first place scales linearly 
with the number of points in the grid, in other words Tbuiid ~ N . Table 4 shows that for grids holding 
between 250k and 16M nodes, the build time in fact scales sub-linearly. Eventually, linear complexity 
must of course kick in, but it is interesting that it has not yet done so even for a grid holding over 
16M nodes. 

• For the example Helmholtz IV we did not predict linear complexity. This problem models wave- 
propagation in such a way that as n grows, the number of wave-lengths along a side of the domain 
grows proportionally. This will eventually destroy the rank-structure in the Schur complements that 
we rely on to reduce the 0{N^-^) scaling of classical nested dissection down to 0{N). Now what is 
interesting is that while the predicted complexity is Tbuiid ^ iV^'^, the observed complexity is only 
Tbuiid ^ N. We expect that the predicted asymptotic scaling will eventually assert itself, but it is to 
us remarkable that it has not yet done so given that the largest domain with 16M nodes represents a 
physical problem of size 100 x 100 wavelengths. 



20 



N 


Laplace 


Helmholtz IV 




Tbuild 

(sec) 


^solvc 

(sec) 


Tbuild 

(sec) 


^solve 

(sec) 


512^ 


13.44 


0.013 


50.78 


0.013 


10242 


45.25 


0.027 


193.58 


0.027 


2048^ 


135.01 


0.058 


765.35 


0.056 


4096^ 


450.73 


0.107 


3167.56 


0.115 



Table 4. Times for the approximation of the Dirichlet-to- Neumann operator for the Laplace 
and Helmholtz IV problems via the accelerated nested dissection method. 



7.3. Performance with body loads. In this experiment, the Random Laplacian I problem is solved in a 
situation with a non-zero body load /. We assume however that the body load is restricted to a small number 
-^body of nodes in the interior. The locations of these nodes is assumed to be fixed. Our objective is now to 
construct a solution operator that constructs the vector of fluxes v on the boundary (the discrete "Neumann 
data") given a vector g of Dirichlct data, and a vector / G K^body of body loads at the pre-scribed nodes. 
This solution operator has two terms as follows: 

G 9 + F /, 

-^boundary ^ -^boundary -^boundary ^ f -^boundary -^body -^body ^ f 

where A^boundaiy = 4(n — 1) denotes the number of points on the boundary. The matrix G is our by now 
familiar discrete Dirichlct-to-Neumann operator; it is constructed in HBS form. The matrix F is a new 
solution operator that maps the interior body load to boundary fluxes. Since A'body is small, the matrix F is 
built in uncompressed form, and then approximated by a low-rank factorization. 

Table 5 reports the computational times required to build the solution operators F and G in (35) for several 
different grid sizes and different values of A'body The table also reports the relative errors when A^body random 
body loads are placed in a localized area inside the domain. Notice that for a small number of body loads the 
cost is close to that of solving a pure boundary value problem. As expected the computational cost grows as 
the number of body loads is increased. 

Remark 7.1. In this section, we considered a special case where the body load / is restricted to a small 
number of internal nodes. For the general case where the body load / is supported on the entire domain, 
solution operators like (35) can still be constructed. In this case, the matrix F is of size iVboundary x N, and 
should be constructed in a data-sparse format analogous to the HBS format. This operator can be both built 
and applied in 0{N) operations. 



(35) 



-^boundary ^ ^ 
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TV 


Mjody 


Tbuild 


^solvc 


frcl 




10 


13.09 


0.013 


1.02C-6 


5122 


100 


13.25 


0.013 


4.55e-7 




1000 


43.21 


0.015 


3.42e-7 




10 


47.33 


0.027 


1.23e-6 


10242 


100 


48.89 


0.027 


6.46e-7 




1000 


163.05 


0.029 


4.35e-7 




10 


256.57 


0.55 




20482 


100 


268.27 


0.58 






1000 


713.55 


0.059 





Table 5 . Times for building the solution operators and applying the Dirichlet-to- Neumann 
operator when iVbody body loads arc radomnly distributed in the domain. The relative error 
Eroiin the solution is also reported. 

8. Conclusions and generalizations 

This paper presents a fast method for constructing the Dirichlet-to-Neumann operator for elliptic problems with 
no body loads. Numerical results indicate that the method scales linearly with the number of discretization 
points N for a variety of problems. Since application of the solution operator scales linearly with the number of 
boundary points (typically 0{N^/^)), constructing the solution for multiple right-hand sides is essentially free 
once the Dirichlet-to-Neumann operator is built. For a problem involving approximately 16 million unknowns, 
it takes about 8 minutes to build the solution operator, and 0.1 seconds to apply it to a right-hand side. 

The fast direct solver described here relies on the intermediate dense matrices being compressible in the sense 
of being either of low rank, or having the HBS structure described in Section 4. It is currently not well 
understood exactly when this holds, but the numerical experiments in Section 7 indicate that the property is 
remarkably stable across a broad range of test problems. 

In the interest of concision, this paper considered only an operator discrctized by a five-point stencil on a regular 
square grid. However, the scheme does not inherently depend on the special form of either the stencil or the 
grid. We expect that the generalization to other domains and other discretizations in 2D should in principle be 
unproblematic, as long as the computational stencil is not too large. (Fast construction of LU-decompositions 
of the stiffness operator on somewhat general grids is reported in [12].) 

The scheme can also be generalized to problems in three dimensions; the simplistic implementation described 
here would have 0{N^^'^) complexity for the build stage, and 0{N^^^) complexity for the solve stage. Given 
that classical nested dissection in 3D has complexity 0{N^) and 0{N'^^^) for the build and solve stages, this is 
a substantial gain, especially for the solve stage. In principle, one could build a scheme that uses accelerated 
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matrix algebra internally inside the HBS representation to attain 0{N) complexity, but this would require 
significant work beyond that described in this paper. 
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